# Multimodal processing
Gemma 3n E2B It Unsloth Bnb 4bit
Gemma 3n-E2B-it is a lightweight open-source multimodal model launched by Google, built on the same technology as Gemini and optimized for low-resource devices.
Image-to-Text
Transformers English

G
unsloth
4,914
2
Gemma 3n E2B
Gemma 3n is a lightweight and state - of - the - art open - source model family launched by Google, supporting multimodal input and output.
Image-to-Text
Transformers

G
google
206
11
Gemma 3n E4B It
Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google. It is built on the same research and technology as the Gemini model and supports text, audio, and visual inputs.
Image-to-Text
Transformers

G
google
1,690
81
Nuextract 2.0 4B
MIT
NuExtract 2.0 is a series of multimodal models specifically trained for structured information extraction tasks. It supports text and image inputs and has multilingual processing capabilities.
Image-to-Text
Transformers

N
numind
272
3
Google.gemma 3 4b It Qat Int4 Unquantized GGUF
A quantized version of the image-to-text model based on Gemma 3 4B, aiming to make knowledge accessible to the public
Image-to-Text
G
DevQuasar
161
1
Gemma 3 4b It Qat Autoawq
Gemma 3 is a lightweight open-source multimodal model launched by Google, built on Gemini technology, supporting text and image input and generating text output.
Image-to-Text
Safetensors
G
gaunernst
503
1
Smoldocling 256M Preview Mlx Fp16
Apache-2.0
This model is converted from ds4sd/SmolDocling-256M-preview to the MLX format, supporting image-text-to-text tasks.
Image-to-Text
Transformers English

S
ahishamm
24
1
Gemma 3 27b Pt Bnb 4bit
Gemma 3 is a lightweight open model series launched by Google, built on the same research and technology as the Gemini model, supporting multimodal input and text output.
Image-to-Text
Transformers English

G
unsloth
2,009
1
Gemma 3 1b Pt Unsloth Bnb 4bit
Gemma 3 is a series of lightweight open models launched by Google, supporting multimodal input (text and images), with a 128K large context window, suitable for various tasks such as question answering and summarization.
Image-to-Text
Transformers English

G
unsloth
4,481
3
Kaleidoscope Large V1
A document Q&A specialized model fine-tuned based on sberbank-ai/ruBert-large, supporting Russian and English document Q&A tasks.
Question Answering System
Transformers Supports Multiple Languages

K
2KKLabs
214
2
Kaleidoscope Large V1
A document QA model fine-tuned from sberbank-ai/ruBert-large, excelling at extracting answers from documents, supporting Russian and English.
Question Answering System
Transformers Supports Multiple Languages

K
LaciaStudio
297
0
Kaleidoscope Small V1
A document question-answering model fine-tuned based on sberbank-ai/ruBert-base, excelling at extracting answers from document contexts, supporting Russian and English.
Question Answering System
Transformers Supports Multiple Languages

K
2KKLabs
98
0
Ola Image
Apache-2.0
Ola-7B is a multimodal language model jointly developed by Tencent, Tsinghua University, and Nanyang Technological University, based on the Qwen2.5 architecture. It supports processing image, video, audio, and text inputs and outputs text.
Multimodal Fusion
Safetensors Supports Multiple Languages
O
THUdyh
61
3
Mineru
Apache-2.0
This model converts PDF documents into Markdown format while preserving the original document layout structure and accurately recognizing mathematical formulas and tables.
Image-to-Text
Transformers Supports Multiple Languages

M
kitjesen
122
12
Pixtral 12b Nf4
Apache-2.0
A 4-bit quantized version based on the Mistral community's Pixtral-12B, focusing on image text-to-text tasks and supporting Chinese description generation.
Image-to-Text
Transformers

P
SeanScripts
236
20
Florence 2 DocVQA
This is a version of Microsoft's Florence-2 model fine-tuned for 1 day using the Docmatix dataset (5% of the data) with a learning rate of 1e-6
Text-to-Image
Transformers

F
HuggingFaceM4
3,096
60
Kosmos 2 PokemonCards Trl Merged
This is a multimodal model fine-tuned based on Microsoft's Kosmos-2 model, specifically designed for recognizing Pokemon names on Pokemon cards.
Image-to-Text
Transformers English

K
Mit1208
51
1
Cellseg Sribd
Apache-2.0
Cell segmentation model developed by Sribd-med team, suitable for cell instance segmentation tasks in multimodal images
Image Segmentation
Transformers English

C
Lewislou
23
0
Donut Base Finetuned Latvian Receipts V2
MIT
A model based on the Donut architecture, specifically fine-tuned for Latvian receipt data
Text Recognition
Transformers

D
Inesence
13
0
S2t Small Mustc En De St
MIT
A speech-to-text transformer model trained for end-to-end English-to-German speech translation
Speech Recognition
Transformers Supports Multiple Languages

S
facebook
156
0
S2t Small Mustc En Ro St
MIT
A Transformer-based end-to-end speech translation model designed for English to Romanian speech translation
Speech Recognition
Transformers Supports Multiple Languages

S
facebook
19
0
S2t Small Mustc En Fr St
MIT
End-to-end English-to-French speech translation model based on S2T architecture, trained on the MuST-C dataset
Speech Recognition
Transformers Supports Multiple Languages

S
facebook
2,326
2
Featured Recommended AI Models